19 research outputs found
Gambler's Ruin Bandit Problem
In this paper, we propose a new multi-armed bandit problem called the
Gambler's Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a
sequence of rounds, where each round is a Markov Decision Process (MDP) with
two actions (arms): a continuation action that moves the learner randomly over
the state space around the current state; and a terminal action that moves the
learner directly into one of the two terminal states (goal and dead-end state).
The current round ends when a terminal state is reached, and the learner incurs
a positive reward only when the goal state is reached. The objective of the
learner is to maximize its long-term reward (expected number of times the goal
state is reached), without having any prior knowledge on the state transition
probabilities. We first prove a result on the form of the optimal policy for
the GRBP. Then, we define the regret of the learner with respect to an
omnipotent oracle, which acts optimally in each round, and prove that it
increases logarithmically over rounds. We also identify a condition under which
the learner's regret is bounded. A potential application of the GRBP is optimal
medical treatment assignment, in which the continuation action corresponds to a
conservative treatment and the terminal action corresponds to a risky treatment
such as surgery
Inverse Prism based on Temporal Discontinuity and Spatial Dispersion
We introduce the concept of the inverse prism as the dual of the conventional
prism and deduce from this duality an implementation of it based on temporal
discontinuity and spatial dispersion provided by anisotropy. Moreover, we show
that this inverse prism exhibits the following three unique properties:
chromatic refraction birefringence, ordinary-monochromatic and extraordinary-
polychromatic temporal refraction, and linear-to-Lissajous polarization
transformation
Two families of indexable partially observable restless bandits and Whittle index computation
We consider the restless bandits with general state space under partial
observability with two observational models: first, the state of each bandit is
not observable at all, and second, the state of each bandit is observable only
if it is chosen. We assume both models satisfy the restart property under which
we prove indexability of the models and propose the Whittle index policy as the
solution. For the first model, we derive a closed-form expression for the
Whittle index. For the second model, we propose an efficient algorithm to
compute the Whittle index by exploiting the qualitative properties of the
optimal policy. We present detailed numerical experiments for multiple
instances of machine maintenance problem. The result indicates that the Whittle
index policy outperforms myopic policy and can be close to optimal in different
setups
Approximate information state based convergence analysis of recurrent Q-learning
In spite of the large literature on reinforcement learning (RL) algorithms
for partially observable Markov decision processes (POMDPs), a complete
theoretical understanding is still lacking. In a partially observable setting,
the history of data available to the agent increases over time so most
practical algorithms either truncate the history to a finite window or compress
it using a recurrent neural network leading to an agent state that is
non-Markovian. In this paper, it is shown that in spite of the lack of the
Markov property, recurrent Q-learning (RQL) converges in the tabular setting.
Moreover, it is shown that the quality of the converged limit depends on the
quality of the representation which is quantified in terms of what is known as
an approximate information state (AIS). Based on this characterization of the
approximation error, a variant of RQL with AIS losses is presented. This
variant performs better than a strong baseline for RQL that does not use AIS
losses. It is demonstrated that there is a strong correlation between the
performance of RQL over time and the loss associated with the AIS
representation.Comment: 25 pages, 6 figure
The global burden of cancer attributable to risk factors, 2010-19 : a systematic analysis for the Global Burden of Disease Study 2019
Background Understanding the magnitude of cancer burden attributable to potentially modifiable risk factors is crucial for development of effective prevention and mitigation strategies. We analysed results from the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2019 to inform cancer control planning efforts globally. Methods The GBD 2019 comparative risk assessment framework was used to estimate cancer burden attributable to behavioural, environmental and occupational, and metabolic risk factors. A total of 82 risk-outcome pairs were included on the basis of the World Cancer Research Fund criteria. Estimated cancer deaths and disability-adjusted life-years (DALYs) in 2019 and change in these measures between 2010 and 2019 are presented. Findings Globally, in 2019, the risk factors included in this analysis accounted for 4.45 million (95% uncertainty interval 4.01-4.94) deaths and 105 million (95.0-116) DALYs for both sexes combined, representing 44.4% (41.3-48.4) of all cancer deaths and 42.0% (39.1-45.6) of all DALYs. There were 2.88 million (2.60-3.18) risk-attributable cancer deaths in males (50.6% [47.8-54.1] of all male cancer deaths) and 1.58 million (1.36-1.84) risk-attributable cancer deaths in females (36.3% [32.5-41.3] of all female cancer deaths). The leading risk factors at the most detailed level globally for risk-attributable cancer deaths and DALYs in 2019 for both sexes combined were smoking, followed by alcohol use and high BMI. Risk-attributable cancer burden varied by world region and Socio-demographic Index (SDI), with smoking, unsafe sex, and alcohol use being the three leading risk factors for risk-attributable cancer DALYs in low SDI locations in 2019, whereas DALYs in high SDI locations mirrored the top three global risk factor rankings. From 2010 to 2019, global risk-attributable cancer deaths increased by 20.4% (12.6-28.4) and DALYs by 16.8% (8.8-25.0), with the greatest percentage increase in metabolic risks (34.7% [27.9-42.8] and 33.3% [25.8-42.0]). Interpretation The leading risk factors contributing to global cancer burden in 2019 were behavioural, whereas metabolic risk factors saw the largest increases between 2010 and 2019. Reducing exposure to these modifiable risk factors would decrease cancer mortality and DALY rates worldwide, and policies should be tailored appropriately to local cancer risk factor burden. Copyright (C) 2022 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license.Peer reviewe
Özel yapılı Markov karar süreçlerinde çevrimiçi öğrenme
Cataloged from PDF version of article.Thesis (M.S.): Bilkent University, Department of Electrical and Electronics Engineering, İhsan Doğramacı Bilkent University, 2017.Includes bibliographical references (leaves 80-86).This thesis proposes three new multi-armed bandit problems, in which the learner
proceeds in a sequence of rounds where each round is a Markov Decision Process
(MDP). The learner's goal is to maximize its cumulative reward without any a
priori knowledge on the state transition probabilities. The rst problem considers
an MDP with sorted states and a continuation action that moves the learner to an
adjacent state; and a terminal action that moves the learner to a terminal state
(goal or dead-end state). In this problem, a round ends and the next round starts
when a terminal state is reached, and the aim of the learner in each round is to
reach the goal state. First, the structure of the optimal policy is derived. Then,
the regret of the learner with respect to an oracle, who takes optimal actions in
each round is de ned, and a learning algorithm that exploits the structure of the
optimal policy is proposed. Finally, it is shown that the regret either increases
logarithmically over rounds or becomes bounded. In the second problem, we
investigate the personalization of a clinical treatment. This process is modeled
as a goal-oriented MDP with dead-end states. Moreover, the state transition
probabilities of the MDP depends on the context of the patients. An algorithm
that uses the rule of optimism in face of uncertainty is proposed to maximize the
number of rounds in which the goal state is reached. In the third problem, we
propose an online learning algorithm for optimal execution in the limit order book
of a nancial asset. Given a certain amount of shares to sell and an allocated time to complete the transaction, the proposed algorithm dynamically learns the
optimal number of shares to sell at each time slot of the allocated time. We model
this problem as an MDP, and derive the form of the optimal policy.by Nima Akbarzadeh.M.S
On learning Whittle index policy for restless bandits with scalable regret
Reinforcement learning is an attractive approach to learn good resource
allocation and scheduling policies based on data when the system model is
unknown. However, the cumulative regret of most RL algorithms scales as , where is the size of the state
space, is the size of the action space, is the horizon, and
the notation hides logarithmic terms. Due to the linear
dependence on the size of the state space, these regret bounds are
prohibitively large for resource allocation and scheduling problems. In this
paper, we present a model-based RL algorithm for such problem which has
scalable regret. In particular, we consider a restless bandit model, and
propose a Thompson-sampling based learning algorithm which is tuned to the
underlying structure of the model. We present two characterizations of the
regret of the proposed algorithm with respect to the Whittle index policy.
First, we show that for a restless bandit with arms and at most
activations at each time, the regret scales either as
or depending on the reward model. Second, under an
additional technical assumption, we show that the regret scales as
. We present numerical examples to illustrate the
salient features of the algorithm